Skip to content

Conversation

@fernando-villalba
Copy link
Collaborator

feat(webhook): add mutating and validating admission webhooks

This commit introduces the pkg/webhook module, implementing the Admission Control layer for the Multigres operator. This ensures that MultigresCluster resources are validated for referential integrity and populated with "visible defaults" before persistence.

Key Features:

  • Mutating Webhook (Defaulter): - Injects the System Catalog (postgres database) if missing.

    • Implements "Visible Defaults": Resolves implicit templates (Level 2) and hydrates the Parent Spec so users see the fully realized configuration in Etcd.
    • Respects API Mutual Exclusion: Explicit templateRef values are preserved (not expanded) to strictly adhere to validation rules forbidding both spec and ref on the same component.
    • Uses a request-scoped Resolver to respect Cluster-level TemplateDefaults, mirroring Controller behavior.
  • Validating Webhook:

    • Referential Integrity: Blocks the creation/update of Clusters referencing non-existent Core, Cell, or Shard templates.
    • Deletion Protection: Prevents the deletion of any Template (Core/Cell/Shard) that is currently referenced by an active MultigresCluster, preventing orphaned references.
    • Child Resource Protection: Blocks direct modification of managed child resources (Cells, Shards) by non-operator users.
  • Infrastructure & Testing:

    • Updated cmd/multigres-operator/main.go to wire up the Webhook Server, including CLI flags for cert management and default templates.
    • Updated Makefile to generate MutatingWebhookConfiguration and ValidatingWebhookConfiguration manifests.
    • Added a high-performance Integration Test suite (pkg/webhook/integration_test.go) using a shared envtest environment. This suite covers happy paths, edge cases (override precedence), and concurrent cache consistency (deletion races).

This commit introduces the `pkg/webhook` module, implementing the Admission Control layer for the Multigres operator. This ensures that `MultigresCluster` resources are validated for referential integrity and populated with "visible defaults" before persistence.

Key Features:
- **Mutating Webhook (Defaulter):** - Injects the System Catalog (`postgres` database) if missing.
  - Implements "Visible Defaults": Resolves implicit templates (Level 2) and hydrates the Parent `Spec` so users see the fully realized configuration in Etcd.
  - Respects API Mutual Exclusion: Explicit `templateRef` values are preserved (not expanded) to strictly adhere to validation rules forbidding both `spec` and `ref` on the same component.
  - Uses a request-scoped `Resolver` to respect Cluster-level `TemplateDefaults`, mirroring Controller behavior.

- **Validating Webhook:**
  - **Referential Integrity:** Blocks the creation/update of Clusters referencing non-existent Core, Cell, or Shard templates.
  - **Deletion Protection:** Prevents the deletion of any Template (Core/Cell/Shard) that is currently referenced by an active MultigresCluster, preventing orphaned references.
  - **Child Resource Protection:** Blocks direct modification of managed child resources (Cells, Shards) by non-operator users.

- **Infrastructure & Testing:**
  - Updated `cmd/multigres-operator/main.go` to wire up the Webhook Server, including CLI flags for cert management and default templates.
  - Updated `Makefile` to generate `MutatingWebhookConfiguration` and `ValidatingWebhookConfiguration` manifests.
  - Added a high-performance Integration Test suite (`pkg/webhook/integration_test.go`) using a shared `envtest` environment. This suite covers happy paths, edge cases (override precedence), and concurrent cache consistency (deletion races).
@github-actions

This comment has been minimized.

…itable cert volume

This commit resolves multiple critical issues preventing the Operator's webhook server from starting up successfully in a secure, production-like environment (e.g., Kind).

1. Fix RBAC Permissions:
   The self-signed certificate strategy requires the operator to manage Secrets (for the CA), Services (for the endpoint), and to patch WebhookConfigurations (to inject the CA bundle).
   - Added `// +kubebuilder:rbac` markers to `pkg/webhook/cert/manager.go`.
   - Regenerated `config/rbac/role.yaml` to include permissions for `secrets`, `services`, `mutatingwebhookconfigurations`, and `validatingwebhookconfigurations`.

2. Fix Namespace Mismatch (CrashLoop):
   The operator was crashing because it defaulted to looking for resources in `multigres-system` while running in `multigres-operator`.
   - Updated `config/manager/manager.yaml` to inject the `POD_NAMESPACE` environment variable via the Kubernetes Downward API.
   - Updated `cmd/multigres-operator/main.go` to use `POD_NAMESPACE` as the default for the `--webhook-service-namespace` flag, ensuring the operator effectively "knows where it lives."

3. Fix Read-Only Filesystem Error:
   The deployment enables `readOnlyRootFilesystem: true`, causing the self-signed certificate generator to fail when writing `tls.crt` and `tls.key` to disk.
   - Added an `emptyDir` volume named `cert-dir` in `config/manager/manager.yaml`.
   - Mounted this volume to `/var/run/secrets/webhook` to provide a secure, writable scratch space for ephemeral certificates.

4. Build System:
   - Updated `Makefile` to include `./pkg/webhook/...` in the `controller-gen` object generation paths to ensuring deepcopy methods are generated for webhook-related types.
… resolver scoping

This commit finalizes the "Hybrid Admission Control" implementation by polishing the Webhook Certificate Manager, enforcing "Deep Defaults" (resources/limits) across all components, and fixing critical scoping bugs in the Template Resolver.

Webhook & Certificate Management:
* Auto-Discovery: Removed the manual `--webhook-service-name` flag. The Certificate Manager now automatically discovers the Service targeting the operator's pods (via `control-plane=controller-manager` label) to generate the correct SANs.
* Service Account: Injected `POD_SERVICE_ACCOUNT` env var to allow the webhook to identify the operator's own identity for child-resource validation exemptions.
* Setup: Refactored `webhook.Setup` to use a direct API client for certificate bootstrapping, avoiding cache start-up race conditions.

Resolver & Defaulting Logic:
* Namespace Scoping Bug: Fixed a critical bug in `handlers/defaulter.go` where the Resolver was using the Operator's namespace to look up templates instead of the Request's namespace.
* Deep Defaults: Implemented `DefaultResources*` functions in `resolver/defaults.go` and applied them in `cell.go` and `shard.go`. The Resolver now fully hydrates ResourceRequirements (Requests/Limits) and Storage if missing.
* Smart Defaulting: Updated `resolver/cluster.go` to inject a default `Pool` ("default") when automatically creating the system "0" shard.
* Template Fallbacks: Updated `handlers/validator.go` to skip "NotFound" errors for templates named "default" (Fallback templates), as the Resolver handles these via hardcoded values if they don't exist in the cluster.

Testing & Validation:
* Unit Tests: Updated `resolver/cell_test.go`, `shard_test.go`, and `cluster_test.go` to expect fully hydrated objects (with default resources and replicas) instead of empty structs.
* Integration Tests: Updated `integration_test.go`, `integration_lifecycle_test.go`, and `integration_resolution_enforcement_test.go` to match the new behavior where the controller enforces defaults immediately.
* Cert Manager Tests: Fixed `cert/manager_test.go` to mock the Service and WebhookConfigurations correctly for the new auto-discovery logic.

Dev Infrastructure:
* Samples: Added `config/samples/default-templates/` containing standard Core, Cell, and Shard templates.
* Makefile: Added `kind-deploy-no-webhook` target and a `config/no-webhook` Kustomize profile to allow deploying the operator in a simplified mode for debugging.
* Kustomize: Enabled the webhook component by default in `config/default`.
Refactors the webhook setup to use the controller-runtime builder pattern,
modernizes handlers to implement CustomValidator/Defaulter interfaces, and
overhauls internal certificate generation to be a managed background process.

Detailed Changes:

1. Webhook Registration & Handlers (Refactor)
   - Switched pkg/webhook/setup.go to use ctrl.NewWebhookManagedBy(mgr).
     This replaces manual server.Register calls and eliminates path drift
     between Go code and Kubebuilder markers.
   - Refactored handlers/defaulter.go and handlers/validator.go to
     implement webhook.CustomDefaulter and webhook.CustomValidator interfaces,
     removing manual JSON decoding logic.

2. Internal Certificate Management (Overhaul)
   - Converted pkg/webhook/cert from a static manager to a CertRotator that
     implements manager.Runnable.
   - Added a Bootstrap() phase in main.go to ensure certificates exist
     on disk before the webhook server binds to the port.
   - Implemented auto-detection: checks if certificates exist (e.g., from
     Cert Manager) and skips internal rotation if they do.
   - Added logic to automatically patch Mutating/ValidatingWebhookConfigurations
     with the generated CA bundle on every rotation.
   - Secrets now attempt to set the Operator's Deployment as an owner for
     garbage collection.

3. Configuration & RBAC
   - Updated api/v1alpha1/multigrescluster_types.go with RBAC permissions
     to manage Secrets and patch WebhookConfigurations.
   - Updated main.go to disable caching for Secret and Deployment objects
     to allow bootstrapping before the cache is fully started.
   - Deprecated explicit cert-strategy flags in favor of auto-detection.

This change was smoke tested, the tests need to be fixed, will be done on next commit.
Refactors webhook handlers to use typed interfaces, hardens integration
tests against race conditions, and tunes certificate rotation settings.

Detailed changes:
- tests(tablegroup): wrap updates in RetryOnConflict to fix flaky
  TestTableGroup_Lifecycle caused by controller/test race conditions.
- refactor(webhook): migrate Defaulter and Validator handlers to
  implement controller-runtime's CustomDefaulter and CustomValidator
  interfaces, simplifying logic and fixing build errors.
- fix(webhook/cert): increase internal certificate rotation threshold
  to 30 days (from 24h) to ensure ample recovery time during failures.
- test(webhook/cert): update manager tests to validate DNS names and
  correctly assert rotation behavior with the new threshold.
- style: fix staticcheck ST1005 error string formatting in validator.
…chitecture

This commit completely rewrites the webhook certificate infrastructure to adhere
to strict security and reliability standards for high-availability operators.

Major Architectural Changes:
- **Split-Secret PKI**: Separated the Root CA (stored in `multigres-operator-ca-secret`) from the Server Certificate. The CA key is never mounted to the pod, preventing compromise if the container is breached.
- **ECDSA Cryptography**: Switched from legacy RSA-2048 to ECDSA P-256 for modern, efficient security.
- **Kubelet Volume Projection**: Removed manual file writing. The operator now creates the Secret and waits for the Kubelet to project it to the `readOnly` volume, ensuring the filesystem is always consistent with the API.

Reliability & Operations:
- **Fast Failover**: Enabled `LeaderElectionReleaseOnCancel` in main.go to minimize downtime during rolling upgrades.
- **Conflict-Free Patching**: Switched WebhookConfiguration updates from `Update` to `Patch` to prevent resource conflicts with other controllers.
- **Dynamic Garbage Collection**: Implemented dynamic OwnerReferences using the `app.kubernetes.io/name` label. Secrets are now automatically cleaned up when the Operator Deployment is deleted, regardless of the deployment's name.
- **Observability**: Added Kubernetes Event recording. Certificate generation and rotation now emit `Normal` events for better administrator visibility.

Fixes:
- Fixed `manager.yaml` to use a `secret` volume mount instead of `emptyDir`, allowing proper bootstrapping.
- Fixed `validator_test.go` assertion to match lowercase error message standards.
- Fixed `manager_test.go` hang by mocking Kubelet file projection.
This commit introduces a dedicated Kustomize overlay and Makefile tooling to support
deploying the operator with Cert-Manager for certificate provisioning, alongside the
existing self-signed mode.

Changes:
- **Base Configuration**: Added `config/certmanager` containing the `ClusterIssuer` and `Certificate` definitions required to request a certificate.
- **Deployment Overlay**: Added `config/deploy-certmanager` which:
  - Inherits from the default configuration.
  - Patches the Deployment to mount the Cert-Manager generated Secret (`webhook-server-cert`).
  - Sets `optional: false` on the secret volume to prevent the Pod from starting before the certificate is ready (fixing potential race conditions).
  - Configures `replacements` to automatically inject the CA Bundle annotation into WebhookConfigurations.
- **Automation**: Updated `Makefile` with:
  - `install-certmanager`: Downloads and installs Cert-Manager v1.19.2 and waits for readiness.
  - `kind-deploy-certmanager`: Orchestrates the full deployment (Install CM -> Apply CRDs -> Deploy Operator with Overlay).
Currently, the operator acts as a "black box," reconciling resources silently. Users must tail operator logs to debug simple configuration errors (like missing templates) or to understand why a resource is stuck in termination.

This commit integrates the Kubernetes `EventRecorder` into the `MultigresCluster` and `TableGroup` controllers to provide first-class observability via `kubectl describe`.

Changes:
- **RBAC:** Added `create` and `patch` permissions for `events` in `role.yaml` and added corresponding Kubebuilder markers to `multigrescluster_controller.go`.
- **Wiring:** Injected `mgr.GetEventRecorderFor(...)` into the Cluster and TableGroup controllers in `main.go`.
- **MultigresCluster Controller:**
  - Emits `Warning/TemplateMissing` immediately if Core, Cell, or Shard templates cannot be resolved.
  - Emits `Warning/ConfigError` if Shard resolution fails.
  - Emits `Normal/Created` events when initializing Global Topo, MultiAdmin, Cells, and TableGroups.
  - Emits `Normal/Deleted` events when pruning orphaned Cells or TableGroups.
  - Emits `Normal/Cleanup` events when the deletion finalizer is waiting for child resources to terminate.
- **TableGroup Controller:**
  - Emits `Normal/Created` when creating new Shard resources.
  - Emits `Normal/Deleted` when pruning orphaned Shards.
- **Tests:**
  - Updated all unit tests to inject `record.NewFakeRecorder(100)` to prevent nil pointer panics.
  - Added explicit event verification logic to the test runner.
  - Added new test cases to verify event emission for orphan pruning and blocking deletion scenarios.

Benefits:
- **Better UX:** Users receive immediate feedback on configuration errors directly in the CLI.
- **Traceability:** Provides a clear audit trail of resource lifecycle events (creation, deletion, pruning) driven by the operator.
@fernando-villalba fernando-villalba marked this pull request as ready for review January 12, 2026 18:30
@rytswd
Copy link
Member

rytswd commented Jan 12, 2026

Can you fix the merge conflict?

…iguration

Reworks the MultigresCluster webhook defaulting strategy to balance immediate
configuration visibility with dynamic template resolution.

Changes:
- **Condition-Based Defaulting**: The Defaulter (pkg/webhook/handlers/defaulter.go)
  now strictly checks for template usage before materializing defaults.
    - If Explicit Templates (Inline or Global) are used -> Skip materialization.
    - If Implicit "default" Template exists -> Promote to Explicit Ref -> Skip materialization.
    - If NO templates exist -> Materialize hardcoded defaults (Static/Visible).
- **Implicit Promotion**: Automatically detects implicit "default" templates and
  updates `spec.templateDefaults` to explicitly reference them. This ensures the
  user is aware that a template is driving the configuration.
- **Fix Phantom Defaults**: Removed logic in pkg/resolver/cluster.go that blindly
  defaulted `templateDefaults` to "default". This prevents the system from
  claiming a template is in use when it isn't.
- **Resolver Helpers**: Added Core/Cell/ShardTemplateExists helpers to
  pkg/resolver/resolver.go to enable lightweight existence checks without
  full object resolution.

Benefits:
- **Correctness**: Minimal clusters receive visible, safe defaults; templated
  clusters remain linked to their templates.
- **Visibility**: The `spec` always reflects the source of truth—either a full
  materialized config or a clear pointer to the template.
- **Dynamism**: Preserves the ability for the Controller to dynamically reconcile
  updates from templates rather than snapshotting them at creation time.
…ertManager

This commit introduces significant improvements to the operator's lifecycle management, default resolution logic, and self-hosted certificate infrastructure.

1. Smart Defaulting & Resolver Logic (Zero Config vs. Templates)
- Context: Addresses the "Invisible Defaults" problem outlined in admission-controller-and-defaults.
- Change: Updated PopulateClusterDefaults in pkg/resolver/cluster.go to be context-aware.
- Logic: The resolver now checks for the existence of an implicit "default" ShardTemplate in the namespace.
    - If a user provides an explicit template -> No defaults injected.
    - If an implicit "default" template exists -> No defaults injected (respects the user's template).
    - If neither exists -> The resolver injects "Zero Config" hardcoded defaults (e.g., a readWrite pool in the "default" shard).
- Benefit: Prevents the operator from overwriting user intent when they define custom templates, while still allowing a "one-click" experience for new users.

2. TableGroup Lifecycle Management (Finalizers)
- Context: Ensures the hierarchy defined in the API v1alpha1 design is respected during deletion.
- Change: Added a Finalizer (tablegroup.multigres.com/finalizer) to the TableGroup controller.
- Logic: Implemented handleDelete to block TableGroup deletion until all owned Shard resources are successfully terminated and removed.
- Benefit: Prevents orphaned resources and ensures ordered teardown of the database topology.

3. Hardened Certificate Management (Strategy C)
- Context: Improves the "In-Process Operator Generation" strategy.
- Change: Heavily refactored pkg/webhook/cert.
    - Rotation: Added support for configurable rotation intervals and logic to handle expiring CAs and Server Certs.
    - Resilience: Added logic to detect and recreate corrupt keys/certs or those signed by unknown CAs.
    - Discovery: Improved findOperatorDeployment to support looking up the operator Deployment by Label Selector (preferred) in addition to Name.
    - Testing: Refactored generator code to allow mocking of x509 functions for deterministic failure testing.

4. Testing & Validation
- Failure Injection: Integrated testutil.NewFakeClientWithFailures across controller and resolver tests to simulate API server errors (e.g., during template lookups or finalizer updates).
- Webhook Setup: Added pkg/webhook/setup_test.go to verify manager registration resilience.
- Validator: Expanded validator_test.go to cover all template existence checks and permission validations (Read-Only child resources).
- Integration: Updated integration_resolution_enforcement_test.go to match the new smart defaulting expectations (expecting specific pools rather than empty maps).

5. Housekeeping
- Gitignore: Added *cover.out to ignore list.
@github-actions

This comment has been minimized.

@github-actions
Copy link

🔬 Go Test Coverage Report

Summary

Coverage Type Result
Threshold 0%
Previous Test Coverage Unknown%
New Test Coverage Unknown%

Status

⚠️ SKIPPED (Build Failed?)

Detail

Show New Coverage
No coverage report generated (Build likely failed).

@fernando-villalba
Copy link
Collaborator Author

I will merge this and fix the module imports for main straight after (new webhook module)

@fernando-villalba fernando-villalba merged commit 44b30bf into main Jan 15, 2026
1 of 3 checks passed
@fernando-villalba fernando-villalba deleted the webhook branch January 15, 2026 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants